Discovering Objects in Dynamically-Generated Web Pages
نویسندگان
چکیده
As the web grows, more and more content is being hidden from the reach of traditional search engines. In this paper, we present THOR, a scalable and efficient tool to mine objects from this hidden web. With precision and recall over 90%, THOR automatically extracts objects of interest from dynamically-generated web pages. Then customized objectidentification algorithms are applied to locate the “interesting” objects in each page. We show that dynamicallygenerated pages tend to be a homogenous subset of pages found on the Web, and that these pages may be separated into distinct clusters of structurally-similar pages. Using this homogeneity across clusters along with traditional information retrieval techniques, we propose a two-phase clustering scheme consisting of a page clustering algorithm and a fragment clustering algorithm. Using this scheme, we can identify object-rich fragments of each page with an average of over 90% precision and over 95% recall.
منابع مشابه
Inférer des Objets Sémantiques du Web Structuré
This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web...
متن کاملAnalysis of navigation behaviour in web sitesintegrating multiple information
The analysis of web usage has mostly fo-cused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The query interfaces of such sites allow the speciication of many search criteria. Their generated results support navigatio...
متن کاملAdaptive Web Prefetching Scheme using Link Anchor Information
Web prefetching provides an effective mechanism to mitigate the user perceived latency when accessing the web pages. The content of web pages provide useful information for generating the predictions, which are used to prefetch the web objects for satisfying the user‟s future requests. In this paper, we propose fuzzy logic based web prefetching scheme that generates effective predictions for pr...
متن کاملDiscovering Test Set Regularities in Relational Domains
Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the trainin...
متن کاملA Categorical Model for Discovering Latent Structure in Social Annotations
The advent of social tagging systems has enabled a new community-based view of the Web in which objects like images, videos, and Web pages are annotated by thousands of users. Understanding the emergent semantics inherent in the socially-generated collection of annotations has important research implications for information discovery and knowledge sharing. To this end, we propose a novel probab...
متن کامل